Goto

Collaborating Authors

 linear function


How does feature learning reshape the function space?

arXiv.org Machine Learning

Feature learning is widely regarded as the key mechanism distinguishing neural networks from fixed-kernel methods, yet its impact on the induced function space remains poorly understood. In this work, we precisely characterize how the function space spanned by the features of a two-layer neural network evolves during gradient descent training. We prove that, in the high-dimensional proportional regime, after a large gradient step the post-update feature distribution is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space and modifies its spectral structure. Our analysis reveals that feature learning can be interpreted as a distributional transformation in either parameter space or input space, equivalently as the introduction of a target-dependent kernel. In particular, it selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling the top radial mode with a target-aligned quadratic harmonic. Overall, our results provide a precise function-space perspective on early-stage feature learning: rather than just rescaling a fixed kernel, gradient descent induces a data-adaptive deformation that preferentially enhances directions aligned with the signal in the data.


3d36c07721a0a5a96436d6c536a132ec-Supplemental.pdf

Neural Information Processing Systems

Figure S1: Estimated Networks 1 & 3 from linear factor models of DS (Top) and Granger causality (Bottom) for simulated data experiment. Each panel shows a grid of DS or Granger causality (GC) features associated with the indicated network estimate. Within each grid, a plot corresponds to signal that is being transmitted from the channel listed on the left to the channel listed at the top. See Figure 1 for a description of the true networks. Each subplot represents the DS from the region listed on the left to the region listed on top. Power spectra are reasonable to model using a linear factor model because they satisfy Definition 1 under reasonable assumptions. We will use Scc(ω) to refer to the spectral power of the signal vc(t) at frequency ω, and vc(ω) to refer to the frequency domain representation of vc(t) at ω.




Overcoming the Convex Barrier for Simplex Inputs

Neural Information Processing Systems

Recent progress in neural network verification has challenged the notion of a convex barrier, that is, an inherent weakness in the convex relaxation of the output of a neural network. Specifically, there now exists a tight relaxation for verifying the robustness of a neural network to ` input perturbations, as well as efficient primal and dual solvers for the relaxation. Buoyed by this success, we consider the problem of developing similar techniques for verifying robustness to input perturbations within the probability simplex. We prove a somewhat surprising result that, in this case, not only can one design a tight relaxation that overcomes the convex barrier, but the size of the relaxation remains linear in the number of neurons, thereby leading to simpler and more efficient algorithms. We establish the scalability of our overall approach via the specification of `1 robustness for CIFAR-10 and MNIST classification, where our approach improves the state of the art verified accuracy by up to 14.4%. Furthermore, we establish its accuracy on a novel and highly challenging task of verifying the robustness of a multi-modal (text and image) classifier to arbitrary changes in its textual input.



Linear regression without correspondence

Neural Information Processing Systems

This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d.


Estimating Learnability in the Sublinear Data Regime

Neural Information Processing Systems

We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is often possible to accurately estimate this ``learnability'' even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a $d$-dimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with $O(\sqrt{d})$ samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. We extend these techniques to a binary classification, and show that the prediction error of the best linear classifier can be accurately estimated given $O(\sqrt{d})$ labeled samples. For comparison, in both the linear regression and binary classification settings, even if there is no noise in the labels, a sample size linear in the dimension, $d$, is required to \emph{learn} any function correlated with the underlying model. We further extend our estimation approach to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. We demonstrate the practical viability of our approaches on synthetic and real data. This ability to estimate the explanatory value of a set of features (or dataset), even in the regime in which there is too little data to realize that explanatory value, may be relevant to the scientific and industrial settings for which data collection is expensive and there are many potentially relevant feature sets that could be collected.